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ABSTRACT 

Motivation: We face the absence of optimized standards 
to guide normalization, comparative analysis, and interpre- 
tation of data sets. One aspect of this is that current meth- 
ods of statistical analysis do not adequately utilize the in- 
formation inherent in the large data sets generated in a 
microarray experiment and require a tradeoff between de- 
tection sensitivity and specificity. 

Results: We present a multistep procedure for analysis 
of mRNA expression data obtained from cDNA array 
methods. To identify and classify differentially expressed 
genes, results from standard paired /-test of normalized 
data are compared with those from a novel method, 
denoted an associative analysis. This method associates 
experimental gene expressions presented as residuals in 
regression analysis against control averaged expressions 
to a common standard — the family of similarly computed 
residuals for low variability genes derived from control 
experiments. By associating changes in expression of 
a given gene to a large family of equally expressed 
genes of the control group, this method utilizes the large 
data sets inherent in microarray experiments to increase 
both specificity and sensitivity. The overall procedure 
is illustrated by tabulation of genes whose expression 
differs significantly between Snell dwarf mice (dw/dw) 
and their phenotypically normal littermates (dw/+, +/+). Of 
the 2352 genes examined only 450-500 were expressed 
above the background levels observed in nonexpressed 
genes and of these 120 were established as differentially 
expressed in dwarf mice at a significance level that 
excludes appearance of false positive determinations. 
Contact: igor-dozmorov@omrf.ouhsc.edu 

INTRODUCTION 

Analysis of the data from large-scale mRNA expression 
studies is nontrivial due to the complexity and size of 
data sets and the fact that technical variation can be 
introduced at different stages in array production and 
processing. Establishing well specified and carefully 
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validated procedures for standardization and normaliza- 
tion of data sets from individual specimens is a key first 
step in analysis, but no single method has proven free 
from ambiguity. Selection criteria based on the ratio of 
measured expression levels fails to account for intra-group 
variations (i.e. normal biologic variance) and can result in 
false positive selections (Kerr et aL, 2000; Dozmorov et 
ai, 2002). More progressive statistical approaches such as 
regression analysis, multidimensional scaling, or principal 
component analysis, have been cogently criticized on a 
number of grounds, including the influence of outliers 
(i.e. genes expressed to different degrees among samples), 
on the parameters of linear regression, principal axis 
choice, and the absence of information about variability 
of individual expression levels within homogeneous 
groups of samples. Nonetheless, attempts at restricting the 
influence of outliers and noncorrelated weak signals has 
not resulted in the development of recognized standards 
(Newton et al, 2001; Wu, 2001). 

Additionally, current statistical methods do not ade- 
quately address the mutually exclusive characteristics of 
sensitivity and specificity. The common practice of using 
low thresholds for selection of significance (P < 0.05) 
can also result in a large number of false positive se- 
lections. This is especially problematic for high-density 
arrays as the number of false positive selections expected 
to occur by chance may limit the ability to perform higher 
order analyses, such as molecular pathway identification 
or disease subphenotyping, that require groups of dif- 
ferentially expressed genes to be accurately predicted. 
Attempts to increase stringency by raising the threshold 
of significance above this value can also be problematic 
as it will cause a compensatory decrease in sensitivity and 
resultant increase in false negative selections. The use of 
large numbers of replicates is able improve this situation 
(Glynne et aL, 2000), although it can be expensive and 
labor intensive. Herein we describe a novel statistical 
method of comparative analysis of cDNA array data. 
The method, denoted 'associative analysis', supplements 
the standard procedure of multiple paired comparisons 
by associating the expression level of each gene in the 
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experimental group with a family of similariy and stably 
expressed genes in the control group. This associative 
analysis enhances the sensitivity of selections beyond 
previously described modifications of the /-test (Miller 
et ai, 2001) and increases the number of differentially 
expressed genes identified without significantly increasing 
the misidentification of false positives. 

In our previous publications (Dozmorov et al., 2001, 
2002), some aspects of these normalization procedures 
have been applied to identification of differentially 
expressed genes in mice of the dwarf genotype (Ames — 
homozygous for the Prop 1^^ mutation, and Snell — 
homozygous for the Pitl^^ mutation). Dwarf mice 
demonstrate similar deficiencies in pituitary dysfunction 
leading to decreased production of growth hormone, 
prolactin and thyroid-stimulating hormone and severe 
alterations in gene expression profiles relative to wild 
type mice (Pfaffle et ai, 1999). Herein, for the first time 
we apply the full suite of statistical procedures discussed 
above to these data sets and fully delineate the methods 
such that they can be assessed and employed by other 
groups. 

SYSTEM AND METHODS 
Experimental methods 

This work uses the same raw experimental data and 

materials as a previous publication (Dozmorov et al, 
2002). This method included screening a commercial 
array of 2352 mouse genes with cDNA derived from Snell 
dwarf mice liver and their sibling controls. Eight 6 month 
old male homozygous Snell dwarf mice (dw/dw) and eight 
normal age and sex matched siblings (dw/+ or +/+) were 
studied. The raw data are available at http://www.omrf. 
ouhsc.edu/OMRP/Research/09/DozmorovI . asp . 

Tissue collection, preparation of RNA and cDNA 
probes, and hybridization was done as described (Doz- 
morov et a/., 2002). 

Outline of normalization and analysis procedures 

1 . Normalization of each expression profile to its own 
background, with selection of the genes expressed 
above background for subsequent adjustment 
and comparison. Note: the 'expressed' genes are 
selected as not associated with a representative 
homogeneous family of background level values 
having normal distribution (Figure 1). 

2. Adjustment of the normalized profiles to each other 
by robust regression analysis of genes expressed 
above background. In this analysis potential out- 
liers are identified and their contribution to the 
calculations down-weighted in an iterative manner, 
diminishing or excluding their influence (Figure 2). 
All expression profiles of both control and exper- 
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imental groups are then re-scaled to a common 
standard — the averaged profile of the control group. 
An alternative procedure for outliers exclusion is 
based on the selection of equally expressed genes 
as homogeneous family of genes with normally 
distributed residuals measured as deviations from 
the regression line calculated against the averaged 
profile (Figure 3). Outliers are thereafter determined 
as having deviations not associated with this 
normal distribution presented by several hundred 
members. 

3. Identification of a group of similarly expressed 
genes from control experiments, denoted 'reference 
group'(Figure 3), to be used for statistical analysis 
of differentially expressed genes using an associa- 
tive /-test (below). The reference group is composed 
of a group of genes expressed above background 
levels with normal low variability of expression in 
control samples as determined by / -test, and whose 
residuals approximate a normal distribution, based 
on the Kolmogorov-Smimov criterion. 

4. Identification of genes differentially expressed in 
experimental vs control groups using three distinct 
statistical approaches (Figure 4). These analyses 
include: 

— Selection of differentially expressed genes using a 
paired /-test (separate tests for a pair of replicates of 
each gene in the control and experimental groups) 
and the commonly accepted significance threshold 
of P < 0.05. A significant proportion of the genes 
identified as differentially expressed will be false 
positive determinations at this threshold level. 

— A /-test using a Bonferroni correction for the 
significance threshold that effectively eliminates 
false positive determinations with simultaneous loss 
of the sensitivity, resulting in increased proportion 
of false negative determinations. 

— An associative /-test in which the replicated residu- 
als for each gene of the experimental group are com- 
pared with the entire set of residuals from the ref- 
erence group defined above. The null hypothesis is 
checked to determine if gene expression in the ex- 
perimental group is associated with the reference 
group defined in step 3. The significance threshold 
is corrected to make improbable the appearance of 
false positive determinations. 

— Comparison of the selections from the paired /-test 
and associative /-tests to classify the differentially 
expressed genes identified as: (a) likely false posi- 
tives (these are genes selected as differentially ex- 
pressed by the paired /-test with P < 0.05, but not 
by the associative /-test); (b) real positives (selected 
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Fig. 1. Normalization of the gene expression profile to its own background, (a) A histogram showing the expression levels of the 1176 
cDNA targets derived from the liver of norma! control mice. These values conform poorly to a normal distribution, with extended upper and 
lower tails apparent. Values in the lower tail result from the background correction procedure (typically negative). Values in the upper tail 
correspond to genes expressed above background in a given sample, (b) Plot of the values with the proposed normal distribution Vs the real 
levels of expression. The straight line is a regression line for the central part of the plot — predominantly background noise. To identify the 
parameters of normal distribution for background, data are sorted in ascending order and, as a first approximation, the mean and SD estimates 
are computed for all spots. Spots at the high end and at the low end of this distribution are then discarded one by one in alternating manner 
if they exceed a criterion set 2 SDs beyond the mean of the remainder of the distribution. The resulting set of nondiscarded points (typically 
between 500 and 600 of the initial set of 1 1 76) represents the fragment of normally distributed background values. This fragment is then used 
for the accurate estimation of the parameters of the normal distribution for background using a standard minimization procedure (Figure Ic). 

The mean and SD of normally distributed background spots are used for the raw intensity S normalization as 5' = (5 Av)/SD. 

The disuibution of S' (ID) has mean of zero and SD = 1 over the set of background genes. The curve shows the distribution of these 
nonexpressed genes. The threshold 3SD = 3 was used for selection of genes expressed above background. 



in both tests) (c) potential positives (genes selected 
in the associative test only). 

RESULTS 

Comparative analysis of gene expressions in the experi- 
mental group is begun by applying the procedures of nor- 
malization to background, and rescaling described above. 
Averaged data from the control group is used as a standard 
for data rescaling. The adjustment of data from the exper- 
imental group to averaged control data will produce the 
same order residuals for equally expressed genes and high- 
light the genes with extreme expression deviations (Fig- 
ure 3 c). 

Single gene comparisons— paired t -test The pa ired / -test 
evaluates the difference between the means of each single 
gene expression in two groups employing the variance 



within groups as an error term. The use of the usual 
threshold P = 0.05 for the selection of differentially 
expressed genes will result in a significant proportion of 
false positive selections from experiments with thousands 
of elements, as is the case in array experiments. When 
using the Atlas 1.2 array set, about 50 false positive 
selections can be expected at this threshold if all genes 
are analyzed. This number can be substantially decreased 
if the analysis excludes genes that are not expressed in 
both groups. Approximately 250 genes were determined 
to be expressed in the experiments described herein. 
The proportion of false positive determinations expected 
in this group at P = 0.05, which is 12 to 13, still 
represents a significant portion of the total number of 
differentially expressed genes identified. Use of replicates 
results in a decrease of the proportion of false negative 
determinations though the proportion of false positives 
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Fig. 2. Comparison of liver samples of two normal mice (Atlas 
I arrays as in Figure Ic). Each data set (Si and S2) has been 
normalized with respect to its own set of background genes, as 
explained above. The values are shown on a logarithmic scale, and 
only 'expressed above background' values where S > Log (3) are 
included. Differentially expressed genes can be identified as those 
whose ratio of expression in two control samples does not fall on or 
close to the line describing similarly expressed genes (filled circles 
in Figure 2a). These genes denoted as ^outliers' were excluded 
from rescaling by use of a robust regression procedure in which 
the influence of outliers is down-weighted in a series of regression 
procedures (software — Number Cruncher Statistical System, Utah, 
2001) with an influence function based on the use of least absolute 
deviations and with twenty subsequent cycles of the regression 
parameters estimations. The resulting plot for completely adjusted 
distributions is presented in Figure 2b with the final regression line 
passing through the origin with the slope equal to 45°. 



remains relatively stable — around one third of all positive 
selections (Figure 4a and b). It is possible to decrease this 
proportion through the use of a corrected /^-value. 



Single gene comparison Bonferroni t-test The Bonfer- 
roni correction is the most common method employed to 
reduce the proportion of false positive determinations in 
multiple comparison analysis and it has been applied to 
array data (Miller et al, 2001). In this method the strin- 
gency of the threshold P is increased to 0.05/(the number 
of compared values). For the expressed genes identified 
above P = 0.05/250 = 2 x 10""^. This increased thresh- 
old produces a new selection of differentially expressed 
genes with the complete absence of false positive deter- 
minations (Figure 4c). While specificity is increased in 
this analysis, sensitivity is sacrificed and a large number of 
false negatives, type 11 errors, are obtained. All selections 
obtained with Bonferroni /-test are present also within se- 
lections made in Associative comparison (see below). 

Associative comparison It is possible to substitute the 
typical paired comparison of gene expressions between 
control and experimental groups with the comparison 
of their residuals. In this analysis it is determined if 
a given gene of the experimental group belongs to (or 
can be associated with) the reference group (as defined 
in Experimental Methods, Outline step 3). Denoted an 
associative /-test, it is actually a standard Student /- 
test applied to the comparison of expression deviations. 
An associative /-test dramatically increases the power of 
comparisons relative to a paired /-test. In the data analyzed 
herein, this is due to the fact that eight replicates from the 
control group are compared with several hundred values of 
the reference group. As a result a large number of positive 
determinations can be obtained with stringent thresholds 
(Figure 4d). 

By comparing the results of these two tests, paired 
/-test with threshold P < 0.05, and associative /-test 
with threshold P < 0.005(P < 1/w, where n = 
number of genes analyzed from the experimental group), 
differentially expressed genes can be classified into three 
groups. Genes defined as differentially expressed by the 
paired /-test but not by the associative /-test are likely 
false positives. Genes identified in both analyses are likely 
real positives, that also include the small sub-group of 
genes selected by the Bonferroni /-test. Genes identified in 
the associative analyses are potentially real positives that 
require additional replicates to confirm. 

We have used this analysis to identify genes that 
are differentially expressed between normal and dwarf 
mice and found 46 genes overexpressed in Snell dwarf 
mice; 49 genes expressed only in Snell mice; 12 genes 
overexpressed in normal control mice; 13 genes expressed 
only in normal mice. (Tables SI A-D in the Supple- 
mentary section http://www.omrf.ouhsc.edu/OMRF/ 
Research/09/Dozmorovl.asp). Of these selected genes 
71 are previously reported as differentially expressed in 
Snell dwarf mice, associated with dwarfism, or strongly 
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Fig. 3. Deviations of gene expression after rescaling to the averaged data in normal mice group (8 mice), (a) Variability of genes within the 
homogeneous control group, (b) The same data after exclusion of hypervariable genes with a SD statistically higher than the homogeneous 
control group (based on an F-criterion). (c) Deviation from normal control averages of gene expressions in dwarf mice samples. 



associated with a similar hormonal status. An additional 
10 selections were obtained by the new method and not 
obtained by previous analysis (Dozmorov et aL, 2002), 
whose relevance to dwarfism or similar hormonal status 



are supported by the indicated references (Table 2S). In 
addition, this new method was able to more correctly pre- 
dict the expression levels of 1 1 genes verified by RT-PCR 
in the previous publication (Dozmorov et ai, 2002). 
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Fig. 4. Sensitivity and specificity of statistical comparison. Numbers of genes with statistically different expression in dwarf mice compared 
with their nomial siblings were selected from 256 expressed genes presented on Atlas-I membrane using three different criterions — paired 
/-test (P < 0.05), Bonferroni /-test (P < 0.05/256), and associative /-test (P < 0.0025). Positive, false positive and false negative 
selections are shown with different filling as indicated. 



DISCUSSION 

We describe herein a useful and practical multistep proce- 
dure to analyze gene expression data from a cDNA array. 
The method is novel in that it: provides a robust means 
of normalizing one channel data using an internal stan- 
dard; establishes a more precise procedure for data scal- 
ing by reducing the influence of outliers upon calcula- 
tion of scalars; increases the sensitivity of differential gene 
identification without loss of specificity; and allows dif- 
ferentially expressed genes to be classified into distinct 
groups of probabilistically known or suspected differen- 
tial expression. 



We demonstrate here an opportunity to increase the 
power of statistical analysis using representative standards 
for selection of potential outliers. This general procedure 
is done three times in these analyses. The first representa- 
tive standard is the family of genes whose hybridization 
signals are at or below the background level. Outliers 
from this standard are defined as 'expressed genes'. The 
second representative standard is the family of normally 
distributed residuals of equally expressed genes of the 
control group. Outliers from this group are hypervariable 
and differentially expressed genes that must be excluded 
fi*om regression analysis for proper adjustment of pairs 
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of profiles under comparison. The third representative 
standard is the family of genes with low variability within 
replicate control samples. There are two types of outliers 
from this standard — hypervariable genes of the control 
group (which were excluded to create this standard) 
and differentially expressed genes of the experimental 
group — whose identification is the main goal of these 
analyses. 

The necessity to initially exclude from comparisons 
expressed from nonexpressed genes was demonstrated 
herein with data obtained from Snell mice using Clontech 
Atlas arrays in which 600 genes were spotted in duplicate. 
Since two independent signals are measured for each gene 
on the membrane, the variation in intensity between the 
duplicated spots for a given gene can be used to assess 
signal reproducibility. If variation were independent of 
signal intensity the ratio of variation between duplicate 
spots would be distributed around 1 with small random 
variations. However, this was not observed for genes 
expressed below some threshold signal intensity. It is of 
note that this threshold corresponds to our determination 
of background. It is not clear if this so-called background 
threshold is due to technical limitations of measuring 
signal intensity on the array or if it is a real biologic 
threshold defined by genes that are not expressed. Opera- 
tionally however, the addition of this exclusion criterion 
provides a logical cutoff between noncorrelative and 
correlative data and therefore improves the reliability 
of the comparative analysis carried on after this step. A 
similar conclusion has been presented (Newton et al, 
2001; Khan et al,, 2001) though with use of arbitrary 
exclusion criteria. While these exclusion criteria improve 
the homogeneity of selections made using ratios, the arbi- 
trariness can be associated with loss of useful information 
about low abundance genes that can play an important 
role in regulatory biological processes. 

We accomplish further enrichment of reliability utiliz- 
ing and extending previously published theories on signal 
variation. There are different sources for fluctuations in 
residuals. Technological variations represent a random 
component of deviation and are therefore common for 
all expressions. Some publications demonstrate the 
dependence of technological fluctuations on the level of 
gene expression, and a resultant nonnormal distribution 
of these values (Kerr et al, 2000; Newton et al, 2001). 
The two main sources of heterogeneity in gene expression 
variations are indicated in Rocke and Durbin (2001) as, 
the 'additive component', prominent at low expression 
levels, and the 'multiplicative component', prominent at 
high expression levels. The intensity measurement yij for 
gene / 6 / = (zi , . . . , /„} in sample y 6 J = {yi , . . . , 
is modeled by the equation (Rocke and Durbin, 2001 ; Zien 
et al. , 200 1 ) yij — aij + ^.iJ xe^ + Stj where a — is the 
normal background (and independent of expression level). 



fji — the expression level in arbitrary units, e — is first error 
term (additive) — which represents the standard deviation 
of background, and t} — is the second error term, which 
represents the proportional error (multiplicative). The 
first error term is excluded in our analysis by eliminating 
expression values at or below background levels. The 
second error term is transformed from multiplicative (and 
therefore expression-dependent, increasing in proportion 
to expression level (Lee et al., 2000), into additive — or 
expression independent) by log-transformation of data 
as in Rocke and Durbin (2001): log(y) = logifx) -I- rj, 
where i) is the residual for log-transformed data. The 
independence of rj from individual gene expressions is 
proven by the vendor (Atlas Manual, 2000) and confirmed 
with the Kolmogorov-Smimov normality test in our 
experiments. 

Not surprisingly, we have found that the number of 
repetitions is critical in achieving adequate specificity 
(low false positives) and sensitivity (low false negatives). 
Due to the stochastic character of the above-mentioned 
fluctuations, replication and averaging is a sensible 
method to reduce the noise level. Only those transcripts 
that are truly altered by an experimental factor will have 
a reproducible change and become more statistically 
significant with repetition; those changes that result from 
noise will not become more significant with repetition. 
Thus, sensitivity increases with repetition at a fixed 
specificity. Both the paired r-test and the associative /-test 
demonstrate similar improvement in sensitivity through 
replication. However, the specificity of paired /-test 
remains unchanged when using from 4 to 8 replicates. 
This is due to the use of the necessity to use conservative 
methods to protect from false positive determinations 
when using the paired /-test. These methods result in the 
loss of information about the majority of false negative 
expression differences. This information once lost, is not 
regained through additional replicates. In the associative 
/-test selections are made at a significance threshold 
high enough to exclude the appearance of false positive 
determinations. However, the number of comparisons 
made between a given experimental gene and the family 
of similarly expressed genes in the control condition 
assures that few false negative determinations will occur. 
Increased repetition can therefore be used to enhance the 
overall statistical significance of the selections made using 
this method (Figure 4). Conformation of the increased 
sensitivity of this method was obtained from a literature 
search of genes whose expression has been shown to be 
different in Snell mice and related model systems . At 
the level of sensitivity with less than one false positive 
determination the associative method selects a larger 
number of differentially expressed genes documented 
in the literature to have links with dwarfism or similar 
abnormalities in hormonal status than previous methods 
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utilizing a paired analysis. Only half (around 30) of the 
differences obtained by microarray studies that utilized 
a standard paired analysis (Dozmorov et al, 2002) were 
confirmed in the current analysis. Importantly, only those 
genes confirmed by the associative method have been 
shown to be related with a premature aging phenotype 
in empirical studies, suggesting the methods described 
herein do indeed increase the specificity of differential 
gene identification. 

The associative method also enhances the information 
obtained from microarray experiments beyond common 
approaches because it discriminates between genes that 
are differentially expressed from those that are expressed 
only in one state. For example Calgranulin B has been 
shown previously by RT-PCR to be undetectable in 
normal mice, as predicted by the method described herein, 
yet selected as differentially expressed in a previous 
analysis utilizing only a standard paired comparison 
(Dozmorov et al, 2002). Procedures similar to associative 
analysis have been previously proposed (Newton et al, 
2001; Rocke and Durbin, 2001). However, there are 
critical differences between these methods and ours. For 
example in Rocke and Durbin (2001) all genes were 
used as a reference group without excluding hypervariable 
genes. The presence of hypervariable genes increases the 
standard deviation of the reference group thus reducing the 
power of the associative analysis. Moreover, inclusion of 
hypervariable genes results in a nonnormally distributed 
reference group preventing the use of common statistical 
tests. 

By testing the hypothesis of association of any potential 
outlier with a large representative standard (typically sev- 
eral hundred elements) the statistical power of the analysis 
is increased over that achieved with traditional single gene 
comparisons, which are powered only by the numbers of 
replicates. The higher power of the associative test, thus, 
increases sensitivity without loss of specificity. When used 
in combination with a traditional paired analysis, this in- 
creased statistical power also allows the use of traditional 
low level significance cutoffs in the standard paired analy- 
sis {P < 0.05) without the risk of including false positive 
selections. The associative analysis is therefore based on 
an idea opposite to the commonly held view that large- 
scale array experiments suffer from compensatory trade- 
offs in sensitivity and specificity. In fact the procedures 
presented herein demonstrate that large-scale data sets are 
extraordinarity information rich and provide a means for 
discriminating common technical variation from individ- 
ual biological variability. 
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